    Optimal Encodings for Range Min-Max and Top-k

    In this paper we consider various encoding problems for range queries on arrays. In these problems, the goal is that the encoding occupies the information theoretic minimum space required to answer a particular set of range queries. Given an array A[1..n]A[1..n] a range top-kk query on an arbitrary range [i,j][1,n][i,j] \subseteq [1,n] asks us to return the ordered set of indices {l1,...,lk}\{l_1 ,...,l_k \} such that A[lm]A[l_m] is the mm-th largest element in A[i..j]A[i..j]. We present optimal encodings for range top-kk queries, as well as for a new problem which we call range min-max, in which the goal is to return the indices of both the minimum and maximum element in a range

    Encodings of Range Maximum-Sum Segment Queries and Applications

    Given an array A containing arbitrary (positive and negative) numbers, we consider the problem of supporting range maximum-sum segment queries on A: i.e., given an arbitrary range [i,j], return the subrange [i' ,j' ] \subseteq [i,j] such that the sum of the numbers in A[i'..j'] is maximized. Chen and Chao [Disc. App. Math. 2007] presented a data structure for this problem that occupies {\Theta}(n) words, can be constructed in {\Theta}(n) time, and supports queries in {\Theta}(1) time. Our first result is that if only the indices [i',j'] are desired (rather than the maximum sum achieved in that subrange), then it is possible to reduce the space to {\Theta}(n) bits, regardless the numbers stored in A, while retaining the same construction and query time. We also improve the best known space lower bound for any data structure that supports range maximum-sum segment queries from n bits to 1.89113n - {\Theta}(lg n) bits, for sufficiently large values of n. Finally, we provide a new application of this data structure which simplifies a previously known linear time algorithm for finding k-covers: i.e., given an array A of n numbers and a number k, find k disjoint subranges [i_1 ,j_1 ],...,[i_k ,j_k ], such that the total sum of all the numbers in the subranges is maximized

    Weighted ancestors in suffix trees

    The classical, ubiquitous, predecessor problem is to construct a data structure for a set of integers that supports fast predecessor queries. Its generalization to weighted trees, a.k.a. the weighted ancestor problem, has been extensively explored and successfully reduced to the predecessor problem. It is known that any solution for both problems with an input set from a polynomially bounded universe that preprocesses a weighted tree in O(n polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most important and frequent application of the weighted ancestors problem is for suffix trees. It has been a long-standing open question whether the weighted ancestors problem has better bounds for suffix trees. We answer this question positively: we show that a suffix tree built for a text w[1..n] can be preprocessed using O(n) extra space, so that queries can be answered in O(1) time. Thus we improve the running times of several applications. Our improvement is based on a number of data structure tools and a periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201

    Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P)

    In this paper, a compressed membership problem for finite automata, both deterministic and non-deterministic, with compressed transition labels is studied. The compression is represented by straight-line programs (SLPs), i.e. context-free grammars generating exactly one string. A novel technique of dealing with SLPs is introduced: the SLPs are recompressed, so that substrings of the input text are encoded in SLPs labelling the transitions of the NFA (DFA) in the same way, as in the SLP representing the input text. To this end, the SLPs are locally decompressed and then recompressed in a uniform way. Furthermore, such recompression induces only small changes in the automaton, in particular, the size of the automaton remains polynomial. Using this technique it is shown that the compressed membership for NFA with compressed labels is in NP, thus confirming the conjecture of Plandowski and Rytter and extending the partial result of Lohrey and Mathissen; as it is already known, that this problem is NP-hard, we settle its exact computational complexity. Moreover, the same technique applied to the compressed membership for DFA with compressed labels yields that this problem is in P; for this problem, only trivial upper-bound PSPACE was known

    A Combinatorial Approach to Collapsing Words

    RLE Edit Distance in Near Optimal Time

    We show that the edit distance between two run-length encoded strings of compressed lengths m and n respectively, can be computed in O(mn log(mn)) time. This improves the previous record by a factor of O(n/log(mn)). The running time of our algorithm is within subpolynomial factors of being optimal, subject to the standard SETH-hardness assumption. This effectively closes a line of algorithmic research first started in 1993

    Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size zz of the Lempel-Ziv parse or the number rr of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ\gamma of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ\gamma is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function STS_T counting the cardinalities of the sets of substrings of each length of TT, also known as the substring complexity. This new measure is defined as δ=sup{ST(k)/k,k1}\delta= \sup\{S_T(k)/k, k\geq 1\} and lower bounds all the measures previously considered. In particular, δγ\delta\leq \gamma always holds and δ\delta can be computed in O(n)\mathcal{O}(n) time using Ω(n)\Omega(n) working space. Kociumaka et al. showed that if δ\delta is given, one can construct an O(δlognδ)\mathcal{O}(\delta \log \frac{n}{\delta})-sized representation of TT supporting efficient direct access and efficient pattern matching queries on TT. Given that for highly compressible strings, δ\delta is significantly smaller than nn, it is natural to pose the following question: Can we compute δ\delta efficiently using sublinear working space? It is straightforward to show that any algorithm computing δ\delta using O(b)\mathcal{O}(b) space requires Ω(n2o(1)/b)\Omega(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an O(n3/b2)\mathcal{O}(n^3/b^2)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[1,n]b\in[1,n]; and an O~(n2/b)\tilde{\mathcal{O}}(n^2/b)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[n2/3,n]b\in[n^{2/3},n]

    On Maximal Unbordered Factors

    Given a string SS of length nn, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between nn and the length of the maximal unbordered factor of SS. We prove that for the alphabet of size σ5\sigma \ge 5 the expected length of the maximal unbordered factor of a string of length~nn is at least 0.99n0.99 n (for sufficiently large values of nn). As an application of this result, we propose a new algorithm for computing the maximal unbordered factor of a string.Comment: Accepted to the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015

    Substring Complexity in Sublinear Space

    Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in O(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an O(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using O(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - O((n3log b)/b2) time using O(b) space, for any b ∈ [1,n], in the comparison model. - Õ(n2/b) time using Õ(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an Õ(n^{1+ε})-time and Õ(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities

    The Dynamic k-Mismatch Problem

    The text-to-pattern Hamming distances problem asks to compute the Hamming distances between a given pattern of length mm and all length-mm substrings of a given text of length nmn\ge m. We focus on the kk-mismatch version of the problem, where a distance needs to be returned only if it does not exceed a threshold kk. We assume n2mn\le 2m (in general, one can partition the text into overlapping blocks). In this work, we show data structures for the dynamic version of this problem supporting two operations: An update performs a single-letter substitution in the pattern or the text, and a query, given an index ii, returns the Hamming distance between the pattern and the text substring starting at position ii, or reports that it exceeds kk. First, we show a data structure with O~(1)\tilde{O}(1) update and O~(k)\tilde{O}(k) query time. Then we show that O~(k)\tilde{O}(k) update and O~(1)\tilde{O}(1) query time is also possible. These two provide an optimal trade-off for the dynamic kk-mismatch problem with knk \le \sqrt{n}: we prove that, conditioned on the strong 3SUM conjecture, one cannot simultaneously achieve k1Ω(1)k^{1-\Omega(1)} time for all operations. For knk\ge \sqrt{n}, we give another lower bound, conditioned on the Online Matrix-Vector conjecture, that excludes algorithms taking n1/2Ω(1)n^{1/2-\Omega(1)} time per operation. This is tight for constant-sized alphabets: Clifford et al. (STACS 2018) achieved O~(n)\tilde{O}(\sqrt{n}) time per operation in that case, but with O~(n3/4)\tilde{O}(n^{3/4}) time per operation for large alphabets. We improve and extend this result with an algorithm that, given 1xk1\le x\le k, achieves update time O~(nk+nkx)\tilde{O}(\frac{n}{k} +\sqrt{\frac{nk}{x}}) and query time O~(x)\tilde{O}(x). In particular, for knk\ge \sqrt{n}, an appropriate choice of xx yields O~(nk3)\tilde{O}(\sqrt[3]{nk}) time per operation, which is O~(n2/3)\tilde{O}(n^{2/3}) when no threshold kk is provided